Introduction¶
In this notebook, I will focus on geospatial techniques using crime data from Kansas City, MO. While the dataset offers extensive opportunities for analysis, this notebook will remain concise, providing a snapshot of key insights. The primary goal is to demonstrate proficiency with tools such as Folium, Plotly, Pandas, and NumPy.
About the Data: Kansas City Police Department (KCPD) Crime Data¶
This dataset contains detailed information about reported crimes in Kansas City, Missouri, for the year 2024. It is designed to support geospatial, temporal, and categorical analyses of crime patterns. You may notice that the data markers center on the right side of Kansas City with a clear distinction. This is because Kansas City sits on the border of Kansas and Missouri and there is a separate Police Department for Kansas City, Kansas. This data is only the Missouri statistics. The data includes information on the type of crimes, their locations, involved individuals, and temporal details. This is real-world data that is updated on a weekly basis, available at https://data.kcmo.org/Crime/KCPD-Crime-Data-2024/isbe-v4d8/about_data.
Columns and Descriptions¶
- Report_No:
The unique identifier assigned to each crime report.
- Reported_Date:
The date when the crime was officially reported.
- Reported_Time:
The exact time when the crime was reported (in 24-hour format).
- From_Date:
The start date of the crime incident, indicating when it was first observed.
- From_Time:
The start time of the crime incident.
- To_Date:
The end date of the crime incident, if applicable. Null values indicate ongoing or open cases.
- To_Time:
The end time of the crime incident, if applicable.
- Offense:
The general category of the offense (e.g., Assault, Burglary).
- IBRS:
The code corresponding to the Incident-Based Reporting System, used for federal crime reporting.
- Description:
A detailed description of the crime (e.g., Motor Vehicle Theft, Trespass).
- Beat:
The patrol beat or subregion within the police department's jurisdiction.
- Address:
The address where the crime occurred, anonymized to maintain privacy.
- City:
The city where the crime occurred (e.g., Kansas City).
- Zip Code:
The postal code of the crime's location.
- Rep_Dist:
The reporting district, a smaller sub-area for statistical crime tracking.
- Area:
The patrol division or geographic area within the police department (e.g., CPD, SCP).
- DVFlag:
A Boolean flag indicating whether the crime is domestic violence-related.
- Involvement:
The role of the individual involved in the crime (e.g., Victim (VIC), Suspect (SUS)).
- Race:
The race of the involved individual (if applicable).
- Sex:
The gender of the involved individual (e.g., Male (M), Female (F)).
- Age:
The age of the involved individual.
- Fire Arm Used Flag:
A Boolean flag indicating whether a firearm was involved in the incident.
- Location:
The latitude and longitude coordinates of the crime's location (e.g., POINT (-94.5856 39.06537)).
- Age_Range:
The age range of the involved individual, categorized into brackets (e.g., 18-24, 25-34).
IMPORTING LIBRARIES AND DATA¶
!pip install folium branca
import folium
from folium.plugins import HeatMap
import matplotlib.pyplot as plt
import plotly.express as px
import numpy as np
import pandas as pd
import warnings
import termcolor
import plotly.io as pio
pio.renderers.default = 'notebook_connected'
warnings.filterwarnings('ignore')
print('libraries installed!')
Requirement already satisfied: folium in /opt/anaconda3/lib/python3.12/site-packages (0.18.0) Requirement already satisfied: branca in /opt/anaconda3/lib/python3.12/site-packages (0.8.0) Requirement already satisfied: jinja2>=2.9 in /opt/anaconda3/lib/python3.12/site-packages (from folium) (3.1.4) Requirement already satisfied: numpy in /opt/anaconda3/lib/python3.12/site-packages (from folium) (1.26.4) Requirement already satisfied: requests in /opt/anaconda3/lib/python3.12/site-packages (from folium) (2.32.2) Requirement already satisfied: xyzservices in /opt/anaconda3/lib/python3.12/site-packages (from folium) (2022.9.0) Requirement already satisfied: MarkupSafe>=2.0 in /opt/anaconda3/lib/python3.12/site-packages (from jinja2>=2.9->folium) (2.1.3) Requirement already satisfied: charset-normalizer<4,>=2 in /opt/anaconda3/lib/python3.12/site-packages (from requests->folium) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in /opt/anaconda3/lib/python3.12/site-packages (from requests->folium) (3.7) Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/anaconda3/lib/python3.12/site-packages (from requests->folium) (2.2.2) Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/lib/python3.12/site-packages (from requests->folium) (2024.8.30) libraries installed!
df = pd.read_csv('KCPD_Crime_Data_2024_20241201.csv')
#The default max column output is limited and this snippet allows me to render every column when a dataframe is called.
pd.options.display.max_columns = None
#Ensure the dataset imported correctly.
df.head()
| Report_No | Reported_Date | Reported_Time | From_Date | From_Time | To_Date | To_Time | Offense | IBRS | Description | Beat | Address | City | Zip Code | Rep_Dist | Area | DVFlag | Involvement | Race | Sex | Age | Fire Arm Used Flag | Location | Age_Range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | KC24000270 | 01/02/2024 | 04:53 | 01/02/2024 | 04:53 | NaN | NaN | Trespass of Real Property | 90J | Trespass of Real Property | 113.0 | 1200 MAIN ST | KANSAS CITY | 64106.0 | PJ1087 | CPD | False | VIC | NaN | NaN | NaN | False | NaN | NaN |
| 1 | KC24000567 | 01/03/2024 | 16:30 | 12/31/2023 | 16:30 | NaN | NaN | Stealing from Building/Residence | 23H | All Other Larceny | 132.0 | 3400 MAIN ST | KANSAS CITY | 64111.0 | PJ2753 | CPD | False | VIC | NaN | NaN | NaN | False | POINT (-94.5856 39.06537) | NaN |
| 2 | KC24000877 | 01/04/2024 | 15:15 | 01/04/2024 | 15:15 | NaN | NaN | Vehicular - Non-Injury Hit and Run | NaN | NaN | 123.0 | W I 670 HWY and W I 70 HWY | KANSAS CITY | NaN | NaN | CPD | False | SUS | B | M | 19.0 | False | NaN | 18-24 |
| 3 | KC24001196 | 01/06/2024 | 00:48 | 01/06/2024 | 00:48 | NaN | NaN | Stolen Auto | 240 | Motor Vehicle Theft | 642.0 | 10300 N CHERRY DR | KANSAS CITY | NaN | NaN | SCP | False | CMP VIC | W | M | 24.0 | False | POINT (-94.572100389 39.280825631) | 18-24 |
| 4 | KC24001447 | 01/07/2024 | 07:00 | 01/07/2024 | 07:00 | NaN | NaN | Assault (Aggravated) | 26C | Impersonation | 115.0 | 00 W PERSHING RD | KANSAS CITY | 64108.0 | PJ1831 | CPD | False | VIC OTH | W | M | 56.0 | False | NaN | 55-64 |
DATA PRE-PROCESSING¶
print(df.info())
print(df.isnull().sum())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 95932 entries, 0 to 95931 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Report_No 95932 non-null object 1 Reported_Date 95932 non-null object 2 Reported_Time 95932 non-null object 3 From_Date 95932 non-null object 4 From_Time 95932 non-null object 5 To_Date 31503 non-null object 6 To_Time 31503 non-null object 7 Offense 95932 non-null object 8 IBRS 87350 non-null object 9 Description 87350 non-null object 10 Beat 95926 non-null float64 11 Address 95932 non-null object 12 City 95932 non-null object 13 Zip Code 91243 non-null float64 14 Rep_Dist 86051 non-null object 15 Area 95929 non-null object 16 DVFlag 95932 non-null bool 17 Involvement 95932 non-null object 18 Race 83724 non-null object 19 Sex 84772 non-null object 20 Age 69651 non-null float64 21 Fire Arm Used Flag 95932 non-null bool 22 Location 93795 non-null object 23 Age_Range 69651 non-null object dtypes: bool(2), float64(3), object(19) memory usage: 16.3+ MB None Report_No 0 Reported_Date 0 Reported_Time 0 From_Date 0 From_Time 0 To_Date 64429 To_Time 64429 Offense 0 IBRS 8582 Description 8582 Beat 6 Address 0 City 0 Zip Code 4689 Rep_Dist 9881 Area 3 DVFlag 0 Involvement 0 Race 12208 Sex 11160 Age 26281 Fire Arm Used Flag 0 Location 2137 Age_Range 26281 dtype: int64
Observations
The time and date columns are not in timedate format.
Many columns have a large amount of null-values and those need to be accounted for.
#Ensure the dates are in datetime format for accurate temporal analysis.
df['Reported_Date'] = pd.to_datetime(df['Reported_Date'])
df['From_Date'] = pd.to_datetime(df['From_Date'])
#Getting rid of the colons to simplify timedata conversion.
df[['Reported_Time', 'From_Time']] = df[['Reported_Time', 'From_Time']].replace(':', '', regex=True)
#Convert integer to string with leading zeros and format to HH:MM
df['Reported_Time'] = df['Reported_Time'].apply(lambda x: f"{int(x):04d}") #Ensure 4-digit format
df['From_Time'] = df['From_Time'].apply(lambda x: f"{int(x):04d}") #Ensure 4-digit format
df['Reported_Time'] = pd.to_datetime(df['Reported_Time'], format='%H%M').dt.time
df['From_Time'] = pd.to_datetime(df['From_Time'], format='%H%M').dt.time
df.head(1)
| Report_No | Reported_Date | Reported_Time | From_Date | From_Time | To_Date | To_Time | Offense | IBRS | Description | Beat | Address | City | Zip Code | Rep_Dist | Area | DVFlag | Involvement | Race | Sex | Age | Fire Arm Used Flag | Location | Age_Range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | KC24000270 | 2024-01-02 | 04:53:00 | 2024-01-02 | 04:53:00 | NaN | NaN | Trespass of Real Property | 90J | Trespass of Real Property | 113.0 | 1200 MAIN ST | KANSAS CITY | 64106.0 | PJ1087 | CPD | False | VIC | NaN | NaN | NaN | False | NaN | NaN |
df[['To_Date','To_Time']].isnull().sum()
To_Date 64429 To_Time 64429 dtype: int64
I'm making the assumption that these values are null due to the cases still being open.
#Keeping the values in timedate and maintaining integrity. Replacing the null values can be filtered out for analysis accuracy.
df['To_Time'] = df['To_Time'].fillna('00:00:00')
df['To_Date'] = pd.to_datetime(df['To_Date'])
df['To_Time'] = df['To_Time'].replace(':', '', regex=True)
#Convert integer to string with leading zeros and format to HH:MM
df['To_Time'] = df['To_Time'].apply(lambda x: f"{int(x):04d}") #Ensure 4-digit format
df['To_Time'] = pd.to_datetime(df['To_Time'], format='%H%M').dt.time
# Replace nulls in To_Date with a placeholder date
df['To_Date'] = df['To_Date'].fillna(pd.Timestamp('2222-01-01'))
df.isnull().sum()
Report_No 0 Reported_Date 0 Reported_Time 0 From_Date 0 From_Time 0 To_Date 0 To_Time 0 Offense 0 IBRS 8582 Description 8582 Beat 6 Address 0 City 0 Zip Code 4689 Rep_Dist 9881 Area 3 DVFlag 0 Involvement 0 Race 12208 Sex 11160 Age 26281 Fire Arm Used Flag 0 Location 2137 Age_Range 26281 dtype: int64
ibrs_with_nulls = df[df['IBRS'].isnull()]
ibrs_null_counts = ibrs_with_nulls.groupby('Offense')['IBRS'].size()
print(ibrs_null_counts) #Very difficult to discern why these values are missing.
Offense
Abuse of a Child 13
Alcohol Influence Report 23
Animal Bite 19
Assault (Aggravated) 17
Assault (Aggravated) on Department Member/Outside Law Enforcement Officer 9
...
Vehicular - Injury Hit and Run 154
Vehicular - Non-Injury 522
Vehicular - Non-Injury Hit and Run 301
Violation of Ex-Parte Order of Protection 62
Violation of Full Order of Protection 79
Name: IBRS, Length: 100, dtype: int64
df['Rep_Dist'].value_counts()
Rep_Dist
PP0321 1758
PJ3601 1271
PJ4990 635
PC0323 557
PJ2650 476
...
PJ2037 1
PJ4689 1
PJ7058 1
PJ6925 1
PJ5160 1
Name: count, Length: 6532, dtype: int64
rdist_with_nulls = df[df['Rep_Dist'].isnull()]
rdist_null_counts = rdist_with_nulls.groupby('Area')['Rep_Dist'].size()
print(rdist_null_counts)
Area CPD 2642 EPD 1443 MPD 1898 NPD 1001 OSPD 420 SCP 1041 SPD 1436 Name: Rep_Dist, dtype: int64
#Replacing null values in the columns with "UNSPECIFIED" because it will have minimal effect on this analysis.
df[['IBRS','Description','Beat','Zip Code','Rep_Dist','Area','Location']] = df[['IBRS','Description','Beat','Zip Code','Rep_Dist','Area','Location']].fillna('UNSPECIFIED')
df.isnull().sum()
Report_No 0 Reported_Date 0 Reported_Time 0 From_Date 0 From_Time 0 To_Date 0 To_Time 0 Offense 0 IBRS 0 Description 0 Beat 0 Address 0 City 0 Zip Code 0 Rep_Dist 0 Area 0 DVFlag 0 Involvement 0 Race 12208 Sex 11160 Age 26281 Fire Arm Used Flag 0 Location 0 Age_Range 26281 dtype: int64
print(df['Race'].unique())
print(df['Sex'].unique())
print(df['Age'].unique())
[nan 'B' 'W' 'U' 'I'] [nan 'M' 'F' 'U'] [nan 19. 24. 56. 26. 45. 68. 42. 53. 61. 37. 36. 38. 20. 41. 30. 33. 66. 47. 62. 25. 63. 28. 22. 46. 48. 70. 18. 31. 29. 32. 21. 51. 49. 60. 39. 34. 40. 77. 27. 35. 55. 23. 43. 50. 76. 57. 44. 54. 64. 72. 52. 59. 58. 71. 78. 69. 82. 65. 79. 73. 75. 67. 88. 85. 74. 81. 83. 87. 91. 86. 80. 98. 90. 84. 93. 95. 92. 89. 94. 96. 99. 97.]
#'U' is already used to classify unknown variables. Using 0 as a placeholder allows me to maintain data integrity since a very large portion of the age column has null values.
df['Race'] = df['Race'].fillna('U')
df['Sex'] = df['Sex'].fillna('U')
df['Age'] = df['Age'].fillna(0)
print(df['Age_Range'].isnull().sum())
26281
#These are only null because the age was unknown.
df['Age_Range'] = df['Age_Range'].fillna('UNSPECIFIED')
df.isnull().sum()
Report_No 0 Reported_Date 0 Reported_Time 0 From_Date 0 From_Time 0 To_Date 0 To_Time 0 Offense 0 IBRS 0 Description 0 Beat 0 Address 0 City 0 Zip Code 0 Rep_Dist 0 Area 0 DVFlag 0 Involvement 0 Race 0 Sex 0 Age 0 Fire Arm Used Flag 0 Location 0 Age_Range 0 dtype: int64
All 0s means all null values have been accounted for.
display(df['Location'])
0 UNSPECIFIED
1 POINT (-94.5856 39.06537)
2 UNSPECIFIED
3 POINT (-94.572100389 39.280825631)
4 UNSPECIFIED
...
95927 POINT (-94.54769 39.03518)
95928 POINT (-94.59937 39.00477)
95929 POINT (-94.58967 39.03483)
95930 UNSPECIFIED
95931 POINT (-94.542065027 39.017103984)
Name: Location, Length: 95932, dtype: object
Observation
- I need to separate the longitude and latitude to make using Folium easier.
#Step 1: Handle "UNSPECIFIED" and keep those rows as NaN in new columns. Having strings in these column will be problematic for Pandas functions.
df['Lat_Lon'] = df['Location'].where(~df['Location'].str.contains('UNSPECIFIED'), np.nan)
#Step 2: Remove the word "POINT" and parentheses for valid rows
df['Lat_Lon'] = df['Lat_Lon'].str.replace('POINT ', '', regex=False).str.strip('()')
#Step 3: Split into Latitude and Longitude
df[['Longitude', 'Latitude']] = df['Lat_Lon'].str.split(' ', expand=True)
#Step 4: Convert Latitude and Longitude to floats
df['Longitude'] = pd.to_numeric(df['Longitude'], errors='coerce')
df['Latitude'] = pd.to_numeric(df['Latitude'], errors='coerce')
#Step 5: Drop the 'Lat_Lon' column
df = df.drop(columns=['Lat_Lon'])
display(df.head())
| Report_No | Reported_Date | Reported_Time | From_Date | From_Time | To_Date | To_Time | Offense | IBRS | Description | Beat | Address | City | Zip Code | Rep_Dist | Area | DVFlag | Involvement | Race | Sex | Age | Fire Arm Used Flag | Location | Age_Range | Longitude | Latitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | KC24000270 | 2024-01-02 | 04:53:00 | 2024-01-02 | 04:53:00 | 2222-01-01 | 00:00:00 | Trespass of Real Property | 90J | Trespass of Real Property | 113.0 | 1200 MAIN ST | KANSAS CITY | 64106.0 | PJ1087 | CPD | False | VIC | U | U | 0.0 | False | UNSPECIFIED | UNSPECIFIED | NaN | NaN |
| 1 | KC24000567 | 2024-01-03 | 16:30:00 | 2023-12-31 | 16:30:00 | 2222-01-01 | 00:00:00 | Stealing from Building/Residence | 23H | All Other Larceny | 132.0 | 3400 MAIN ST | KANSAS CITY | 64111.0 | PJ2753 | CPD | False | VIC | U | U | 0.0 | False | POINT (-94.5856 39.06537) | UNSPECIFIED | -94.5856 | 39.065370 |
| 2 | KC24000877 | 2024-01-04 | 15:15:00 | 2024-01-04 | 15:15:00 | 2222-01-01 | 00:00:00 | Vehicular - Non-Injury Hit and Run | UNSPECIFIED | UNSPECIFIED | 123.0 | W I 670 HWY and W I 70 HWY | KANSAS CITY | UNSPECIFIED | UNSPECIFIED | CPD | False | SUS | B | M | 19.0 | False | UNSPECIFIED | 18-24 | NaN | NaN |
| 3 | KC24001196 | 2024-01-06 | 00:48:00 | 2024-01-06 | 00:48:00 | 2222-01-01 | 00:00:00 | Stolen Auto | 240 | Motor Vehicle Theft | 642.0 | 10300 N CHERRY DR | KANSAS CITY | UNSPECIFIED | UNSPECIFIED | SCP | False | CMP VIC | W | M | 24.0 | False | POINT (-94.572100389 39.280825631) | 18-24 | -94.5721 | 39.280826 |
| 4 | KC24001447 | 2024-01-07 | 07:00:00 | 2024-01-07 | 07:00:00 | 2222-01-01 | 00:00:00 | Assault (Aggravated) | 26C | Impersonation | 115.0 | 00 W PERSHING RD | KANSAS CITY | 64108.0 | PJ1831 | CPD | False | VIC OTH | W | M | 56.0 | False | UNSPECIFIED | 55-64 | NaN | NaN |
This is the end of the pre-processing
GEOSPATIAL ANALYSIS¶
#Filter out the rows with null coordinates
geo_data = df.dropna(subset=['Latitude', 'Longitude'])
geo_data['Latitude'] = pd.to_numeric(geo_data['Latitude'], errors='coerce')
geo_data['Longitude'] = pd.to_numeric(geo_data['Longitude'], errors='coerce')
Crime Heatmap¶
- Insights:
- Highlights overall crime density across Kansas City, with areas of higher activity visually intensified.
- Applications:
- Useful for police resource allocation.
- Can inform community awareness campaigns about high-crime areas.
#Create a map centered around Kansas City, MO
center_lat = geo_data['Latitude'].mean()
center_lon = geo_data['Longitude'].mean()
crime_map = folium.Map(location=[center_lat, center_lon], zoom_start=11)
#Create the data for the heatmaps
heatmap_data = geo_data[['Latitude', 'Longitude']].values.tolist()
HeatMap(heatmap_data, radius=8).add_to(crime_map)
crime_map.save('crime_heatmap.html')
crime_map
Observations:¶
- High-density crime areas are clustered in specific zones, likely urban centers or regions with higher foot traffic.
- Outlying areas show significantly less crime, suggesting suburban or rural safety.
- The heatmap effectively identifies regions for priority policing or interventions.
Crime Hotspot Map (with Markers)¶
- Insights:
- Specific markers indicate hotspots with detailed popups showing the number of incidents per area.
- Applications:
- Enhances the ability to drill down into critical zones for actionable insights.
- Can guide decision-making for deploying patrol units or crime prevention initiatives.
# Group by 'Area' to find top hotspots
hotspots = geo_data.groupby('Area').size().sort_values(ascending=False)
print("Hotspots by Incident Count:")
print(hotspots)
#Add markers for the top hotspots
for area, count in hotspots.items():
#Get the mean coordinates for each area
coords = geo_data[geo_data['Area'] == area][['Latitude', 'Longitude']].mean()
#Add marker to the map
folium.Marker(
location=[coords['Latitude'], coords['Longitude']],
popup=f"Area: {area}<br>Incidents: {count}",
icon=folium.Icon(color='red', icon='info-sign')
).add_to(crime_map)
#Add layer control for visibilty enhancement
folium.LayerControl().add_to(crime_map)
crime_map.save('crime_hotspots_map.html')
crime_map
Hotspots by Incident Count: Area CPD 27471 EPD 21862 MPD 18554 SPD 9419 NPD 8747 SCP 7396 OSPD 343 UNSPECIFIED 3 dtype: int64
Observations:¶
- The markers provide precise geographic details for the most incident-heavy areas.
- Popups displaying the number of incidents make it easy to identify the severity of each hotspot.
- Spatial clustering around markers may indicate systemic issues in these areas, such as socioeconomic disparities or inadequate policing.
Layered Heatmaps by Area¶
- Insights:
- Individual heatmap layers allow toggling between areas for a focused analysis of specific regions.
- Applications:
- Helps stakeholders prioritize resources based on area-specific trends.
- Enables dynamic visualization in presentations or dashboards.
#Create a base map
crime_map = folium.Map(location=[39.0997, -94.5786], zoom_start=11)
#Loop through each area and create a separate heatmap layer
for area, group in geo_data.groupby('Area'):
#Skip if there's no location data
if group[['Latitude', 'Longitude']].isnull().any(axis=None):
continue
#FeatureGroup
area_group = folium.FeatureGroup(name=f"Area: {area}")
#Add a heatmap to the FeatureGroup
HeatMap(
group[['Latitude', 'Longitude']].dropna().values.tolist(),
radius=10, blur=12, min_opacity=0.4
).add_to(area_group)
# Add the FeatureGroup to the map
crime_map.add_child(area_group)
#Add LayerControl for toggling areas
folium.LayerControl().add_to(crime_map)
crime_map.save('layered_heatmap_by_area.html')
display(crime_map)
#Plot the data in a bargraph to see the numbers
division_counts = df['Area'].value_counts()
fig = px.bar(
x=division_counts.index,
y=division_counts.values,
color=division_counts.index,
labels={'x': 'Patrol Division', 'y': 'Number of Incidents'},
title='Crime Incidents by Patrol Division',
text=division_counts.values
)
fig.update_traces(
textposition='outside',
textfont=dict(size=10, color='black')
)
fig.update_layout(
xaxis_title="Patrol Division",
yaxis_title="Number of Incidents",
margin=dict(t=50, b=100),
xaxis=dict(tickangle=-45)
)
fig.show()
Observations:¶
- Heatmaps for individual areas show unique crime patterns, which may vary by the local environment or demographics.
- Some areas exhibit significant crime density even within smaller neighborhoods, indicating localized issues.
- Interactive toggling allows deeper insights without overwhelming visual clutter.
Combined Gender-Specific Heatmaps¶
- Insights:
- Separate heatmaps for male and female-related crimes, using distinct gradients for clarity.
- Applications:
- Provides insights into gender-based crime distribution.
- Supports gender-focused interventions and policy-making.
gradients = {
'M': {0.4: 'steelblue', 0.7: 'blue', 1: 'darkblue'}, # Male: Blue gradient
'F': {0.4: 'pink', 0.7: 'hotpink', 1: 'deeppink'} # Female: Pink gradient
}
gender_map = folium.Map(location=[39.0997, -94.5786], zoom_start=11)
#Add heatmaps for both genders on the same map
for gender in ['M', 'F']:
gender_data = df[df['Sex'] == gender][['Latitude', 'Longitude']].dropna()
#Skip if no valid data
if gender_data.empty:
print(f"No data available for gender: {gender}")
continue
#Add heatmap directly to the map
HeatMap(
gender_data.values.tolist(),
radius=12,
blur=15,
gradient=gradients[gender],
min_opacity=0.5,
).add_to(gender_map)
#LayerControl
folium.LayerControl().add_to(gender_map)
gender_map.save('combined_gender_heatmap.html')
display(gender_map)
#Calculate counts for each gender
gender_counts = df['Sex'].value_counts()
fig = px.bar(
x=gender_counts.index,
y=gender_counts.values,
color=gender_counts.index,
labels={'x': 'Gender', 'y': 'Number of Incidents'},
title='Crime Incidents by Gender',
text=gender_counts.values
)
fig.update_traces(
textposition='outside',
textfont=dict(size=12, color='black')
)
fig.update_layout(
xaxis_title="Gender",
yaxis_title="Number of Incidents",
margin=dict(t=50, b=100),
xaxis=dict(tickangle=0)
)
fig.show()
Observations:¶
- Crime patterns differ by gender, with certain areas showing higher incidents for either males or females.
- Male-related crimes tend to cluster around high-traffic regions, while female-related crimes may indicate domestic or targeted violence hotspots.
- The contrasting gradients make it easy to visualize gendered crime distribution.
Top 5 Crimes Heatmap¶
- Insights:
- Heatmaps for the most frequent crimes (e.g., Assault, Theft) with distinct colors.
- Applications:
- Visual tool to understand the spatial prevalence of specific crimes.
- Facilitates targeted crime prevention strategies.
#No 'Unspecified' allowed
filtered_data = df[df['Description'] != 'UNSPECIFIED']
#Find the top 5 crimes
top_5_crimes = filtered_data['Description'].value_counts().head(5).index
#Filter location data for top 5 crimes
top_5_data = df[df['Description'].isin(top_5_crimes)][['Latitude', 'Longitude', 'Description']].dropna()
#Colors for each crime
gradients = {
'Simple Assault': {0.4: 'lightblue', 0.7: 'blue', 1: 'darkblue'},
'Motor Vehicle Theft': {0.4: 'lightgreen', 0.7: 'green', 1: 'darkgreen'},
'Vandalism/Destruction of Property': {0.4: 'lightcoral', 0.7: 'red', 1: 'darkred'},
'Aggravated Assault': {0.4: 'khaki', 0.7: 'gold', 1: 'darkgoldenrod'},
'Shoplifting': {0.4: 'plum', 0.7: 'purple', 1: 'darkviolet'},
}
#Create the base map
crimes_map = folium.Map(location=[39.0997, -94.5786], zoom_start=11)
#Add a heatmap for each crime
for crime in top_5_crimes:
crimes_data = top_5_data[top_5_data['Description'] == crime][['Latitude', 'Longitude']]
if crimes_data.empty:
continue
HeatMap(
crimes_data.values.tolist(),
radius=10,
blur=8,
min_opacity=0.3,
gradient=gradients.get(crime, {0.4: 'lightblue', 0.65: 'blue', 1: 'darkblue'}),
name=crime
).add_to(crimes_map)
folium.LayerControl().add_to(crimes_map)
crimes_map.save('top_5_crimes_heatmap.html')
display(crimes_map)
#Calculate the counts for the top 5 crimes
top_5_crime_counts = df[df['Description'].isin(top_5_crimes)]['Description'].value_counts()
fig = px.bar(
x=top_5_crime_counts.index,
y=top_5_crime_counts.values,
color=top_5_crime_counts.index,
labels={'x': 'Crime Type', 'y': 'Number of Incidents'},
title='Top 5 Crimes by Frequency',
text=top_5_crime_counts.values
)
fig.update_traces(
textposition='outside',
textfont=dict(size=12, color='black')
)
fig.update_layout(
xaxis_title="Crime Type",
yaxis_title="Number of Incidents",
margin=dict(t=50, b=100),
xaxis=dict(tickangle=-45)
)
fig.show()
Observations:¶
- Each crime type has distinct clustering patterns, suggesting environmental factors contributing to specific crimes.
- For instance, motor vehicle thefts might concentrate near parking lots or highways, while assaults could occur more in residential or nightlife districts.
- The color-coding for each crime type makes cross-comparison straightforward.
Domestic Violence Heatmap¶
- Insights:
- Focuses on areas with high incidences of domestic violence (DVFlag=True).
- Applications:
- Guides community services to address domestic violence hotspots.
- Enables specialized intervention programs.
#Filter data where DVFlag is True
dv_data = geo_data[geo_data['DVFlag'] == True]
dv_data = dv_data.dropna(subset=['Latitude', 'Longitude'])
dv_map = folium.Map(location=[39.0997, -94.5786], zoom_start=11)
heatmap_data = dv_data[['Latitude', 'Longitude']].dropna().values.tolist()
#Add a heatmap for domestic violence incidents
HeatMap(
heatmap_data,
radius=10, blur=12, min_opacity=0.4
).add_to(dv_map)
dv_map.save('dv_flag_heatmap.html')
dv_map
dvflag_counts = df['DVFlag'].value_counts()
custom_labels = {True: 'Domestic Violence', False: 'Non-Domestic Violence'}
fig = px.bar(
x=[custom_labels[val] for val in dvflag_counts.index],
y=dvflag_counts.values,
color=[custom_labels[val] for val in dvflag_counts.index],
labels={'x': 'Incident Type', 'y': 'Number of Incidents'},
title='Incidents by DVFlag',
text=dvflag_counts.values
)
fig.update_traces(
textposition='outside',
textfont=dict(size=12, color='black')
)
fig.update_layout(
xaxis_title="Incident Type",
yaxis_title="Number of Incidents",
margin=dict(t=50, b=100),
xaxis=dict(tickangle=0)
)
fig.show()
Observations:¶
- Domestic violence hotspots tend to appear in residential zones, highlighting areas where such incidents are prevalent.
- These areas may benefit from targeted community outreach programs or increased support services.
- The distribution reinforces the need for data-driven interventions to address domestic violence.
Hotspot Map by Streets¶
- Insights:
- Identifies specific streets within each area with high incident counts.
- Applications:
- Informs street-level safety measures such as increased lighting or surveillance.
- Assists city planning to improve safety infrastructure.
map_center = [geo_data['Latitude'].mean(), geo_data['Longitude'].mean()]
division_streets_map = folium.Map(location=map_center, zoom_start=11)
#Step 1: Extract the street name by removing numbers and spaces before the first string character
geo_data['Street'] = geo_data['Address'].str.replace(r'^\d+\s*', '', regex=True)
#Step 2: Filter out rows with "UNSPECIFIED" in the Street or Area column
geo_data = geo_data[~geo_data['Street'].str.contains("UNSPECIFIED", na=False)]
geo_data = geo_data[~geo_data['Area'].str.contains("UNSPECIFIED", na=False)]
#Step 3: Ensure no null values exist in required columns
geo_data = geo_data.dropna(subset=['Area', 'Latitude', 'Longitude'])
#Step 4: Group by 'Area' and 'Street' to find incident counts
area_street_hotspots = (
geo_data.groupby(['Area', 'Street'])
.size()
.reset_index(name='Incident Count')
.sort_values(by=['Area', 'Incident Count'], ascending=[True, False])
)
#Step 5: Get the top 5 streets per area
top_street_per_area = (
area_street_hotspots.groupby('Area', group_keys=False)
.apply(lambda x: x.head(5))
.reset_index(drop=True)
)
#Add a heatmap layer for each Area's top 5 streets
for area in top_street_per_area['Area'].unique():
area_data = geo_data[geo_data['Area'] == area]
top_streets = top_street_per_area[top_street_per_area['Area'] == area]['Street']
area_data = area_data[area_data['Street'].isin(top_streets)]
heatmap_layer = HeatMap(area_data[['Latitude', 'Longitude']].dropna().values, name=f"Hotspots in {area}")
division_streets_map.add_child(heatmap_layer)
folium.LayerControl().add_to(division_streets_map)
division_streets_map.save('hotspot_map.html')
division_streets_map
#I could not figure out how to debug the plotly loop for this so I switched to subplots
#Find the unique areas
unique_areas = top_street_per_area['Area'].unique()
#Set up the subplot grid
num_areas = len(unique_areas)
fig, axes = plt.subplots(num_areas, 1, figsize=(10, 6.5 * num_areas), constrained_layout=True)
colors = plt.cm.tab20(np.linspace(0, 1, 10))
#Create a bar chart for each area
for i, area in enumerate(unique_areas):
ax = axes[i] if num_areas > 1 else axes
area_data = top_street_per_area[top_street_per_area['Area'] == area]
bars = ax.bar(
area_data['Street'],
area_data['Incident Count'],
color=colors[:len(area_data)],
edgecolor='black'
)
for bar in bars:
height = bar.get_height()
ax.annotate(
f"{height}",
xy=(bar.get_x() + bar.get_width() / 2, height),
xytext=(0, 5),
textcoords="offset points",
ha='center',
va='bottom',
fontsize=10,
color='black',
weight='bold'
)
ax.set_title(f"Top 5 Streets in {area}", fontsize=14)
ax.set_xlabel("Street", fontsize=12)
ax.set_ylabel("Incident Count", fontsize=12)
ax.set_xticks(range(len(area_data['Street'])))
ax.set_xticklabels(area_data['Street'], rotation=45, ha='right')
plt.show()
Observations:¶
- Certain streets consistently show high incident counts, possibly indicating problematic areas such as poorly lit zones or high foot traffic.
- The top streets per area reveal patterns of recurring issues that could be linked to infrastructure or local activity.
- Concentrating efforts on these streets could yield significant safety improvements.
CONCLUSION AND FINAL THOUGHTS¶
The analysis provides an overview of crime patterns in Kansas City, leveraging geospatial and categorical insights. Key findings include:
Crime Distribution:
- High-density crime areas are concentrated in urban centers and specific hotspots, with distinct clusters for different crime types and demographics.
Area-Specific Trends:
- Layered heatmaps and hotspot markers reveal significant variations in crime patterns across areas, enabling targeted interventions and resource allocation.
Gendered Crime Insights:
- Gender-specific heatmaps show differing patterns of male and female-related crimes, with implications for gender-focused safety measures.
Top Crime Types:
- Certain crimes, such as assaults and thefts, dominate the dataset and exhibit distinct spatial clustering, which could inform preventative measures.
Domestic Violence Prevalence:
- Domestic violence hotspots highlight areas in need of focused community support and outreach programs.
Street-Level Analysis:
- Identifying specific streets with high incidents offers actionable insights for localized improvements, such as increased lighting or law enforcement presence.
Visual Insights Through Bar Graphs:
- Bar graphs complement the geospatial analysis by quantifying disparities in patrol divisions, gender-based incidents, and top crime types, providing clarity for decision-making.
Implications:¶
- This analysis emphasizes the importance of data-driven decision-making in law enforcement, community safety, and urban planning.
- Stakeholders can use these findings to allocate resources more effectively, design targeted interventions, and implement long-term safety improvements.
Future Directions:¶
- Integrating predictive modeling to forecast crime trends.
- Combining socioeconomic data for deeper insights into crime drivers.
- Enhancing public awareness by making these visualizations accessible to the community.
Final Thoughts¶
Initially, this notebook began as an exploratory data analysis, incorporating Folium maps. However, after a few days, it became clear that a more targeted approach was necessary to avoid getting overwhelmed by the data. Leveraging the available location data, I shifted my focus to map visualizations. While this dataset offers opportunities for deeper analysis, this exercise was primarily about improving my relationship with Folium. It has certainly increased my comfort with the library, and I look forward to exploring more advanced geospatial techniques in the future.